Add auto-resume feature and bump version (0.3.31)#142
Merged
aliroberts merged 1 commit intomainfrom Apr 24, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wraps the optimization loop so that transient network failures (
ConnectionError,ReadTimeout, HTTP 502/503/504) auto-resume the run instead of bailing out. Enabled bydefault, with CLI overrides. Non-transient failures (auth, 4xx, insufficient credits, Ctrl-C) still propagate unchanged. Bumps version to 0.3.31.
CLI
weco run ... [--no-auto-resume] [--auto-resume-max-attempts N]
weco resume ... [--no-auto-resume] [--auto-resume-max-attempts N]
Defaults: enabled, 5 attempts, 5s initial backoff, exponential (×2) capped at 60s.
Implementation notes
_run_loop_with_auto_resumeinweco/optimizer.pydrivesrun_optimization_loopas a closure. On transient exit it sleeps with exponential backoff, callsWecoClient.resume_runsilently (_silent_resume), and re-enters the loop withstart_step = result.final_step. Non-transient results return verbatim.run_optimization_loopnow catchesConnectionError/ReadTimeoutexplicitly and tags themtransient_network_errorinstead of landing in the genericunknownbucket.HTTPErrorbranch is unchanged; transient classification usesreason ∈ {transient_network_error, http_502, http_503, http_504}._silent_resumefailures retry in-place (don't re-invoke the loop), so when the backend is unreachable we don't spin inget_execution_tasksfor 10 minutes between resumeattempts.
on_reconnecting(attempt, max, backoff_s)/on_reconnected()on theOptimizationUIprotocol. Rich UI adds areconnectingstatus (📡, yellow) withattempt/backoff in the status row; plain UI prints
[RECONNECTING]/[RECONNECTED]lines. Exhaustion routes throughon_errorso it lands in the prominent Error row.AutoResumePolicydataclass carries the overrides; bothoptimize()andresume_optimization()accept one and default toAutoResumePolicy()when absent.Tests
tests/test_auto_resume.py: 21 tests covering classification across 12 reasons, happy path, transient-then-success, exhaustion, disabled policy,_silent_resumefailureretries without re-invoking the loop, exponential backoff with cap, and event payload shape.